{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# LoC Data Package Tutorial: United States Elections, Web Archives Data Package\n", "\n", "version 2.0\n", "\n", "This notebook will demonstrate basic usage of using Python for interacting with [data packages from the Library of Congress](https://data.labs.loc.gov/packages/) via the [United States Elections, Web Archives Data Package](https://data.labs.loc.gov/us-elections/) which is derived from the Library's [United States Elections Web Archive](https://www.loc.gov/collections/united-states-elections-web-archive/). We will:\n", "\n", "1. [Output data package sumary](#Output-data-package-summary)\n", "2. [Query the metadata in the data package](#Query-the-metadata-in-the-data-package)\n", "4. [Filter and download CDX index files, analyze text](#Filter-and-download-CDX-index-files,-analyze-text)\n", "\n", "## Prerequisites\n", "\n", "In order to run this notebook, please follow the instructions listed in [this directory's README](https://github.com/LibraryOfCongress/data-exploration/blob/master/Data%20Packages/README.md)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Output data package summary\n", "\n", "First, we will select [United States Elections, Web Archives Data Package](https://data.labs.loc.gov/us-elections/) and output a summary of its files." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | FileType | \n", "Count | \n", "Size | \n", "
---|---|---|---|
0 | \n", ".gz | \n", "394,950 | \n", "227.8GB | \n", "
\n", " | item_id | \n", "item_title | \n", "website_url | \n", "website_id | \n", "website_scopes | \n", "collection | \n", "website_elections | \n", "website_parties | \n", "website_places | \n", "website_districts | \n", "website_thumbnail | \n", "website_start_date | \n", "website_end_date | \n", "item_all_years | \n", "website_all_years | \n", "mods_url | \n", "access_condition | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3460 | \n", "http://www.loc.gov/item/lcwaN0002501/ | \n", "Official Campaign Web Site - Cris Ericson | \n", "http://www.usmjp.com/ | \n", "3415 | \n", "[http://crisericson.com, http://vermontnews.livejournal.com, http://www.myspace.com/usmjp2010, http://crisericson2010.blogspot.com] | \n", "United States Elections, 2012 | \n", "[United States. Congress. Senate, Vermont. Governor] | \n", "[U.S. Marijuana Party, Independent candidates] | \n", "[Vermont, Vermont] | \n", "[None, None] | \n", "http://cdn.loc.gov/service/webcapture/project_1/thumbnails/lcwaS0003415.jpg | \n", "20121003 | \n", "20121019 | \n", "[2002, 2004, 2004, 2006, 2008, 2010, 2012, 2012, 2018, 2018] | \n", "[2012] | \n", "https://tile.loc.gov/storage-services/service/webcapture/project_1/mods/united-states-elections-web-archive/lcwaN0002501.xml | \n", "None | \n", "
\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "redirect | \n", "metatags | \n", "file_size | \n", "offset | \n", "warc_filename | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "com,voter)/home/candidates/info/0,1214,2-11880-,00.html | \n", "20001002182124 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html | \n", "text/html | \n", "200 | \n", "FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP | \n", "- | \n", "- | \n", "5051 | \n", "149 | \n", "unique.20010415093936.arc.gz | \n", "
1 | \n", "com,voter)/home/candidates/info/0,1214,2-18885-,00.html | \n", "20001002185814 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html | \n", "text/html | \n", "200 | \n", "H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL | \n", "- | \n", "- | \n", "4829 | \n", "5200 | \n", "unique.20010415093936.arc.gz | \n", "
2 | \n", "com,voter)/home/candidates/info/0,1214,2-18880-,00.html | \n", "20001002185815 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html | \n", "text/html | \n", "200 | \n", "HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 | \n", "- | \n", "- | \n", "4794 | \n", "10029 | \n", "unique.20010415093936.arc.gz | \n", "
3 | \n", "com,voter)/home/officials/general/1,1195,2-2467-,00.html | \n", "20001002185815 | \n", "http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html | \n", "text/html | \n", "200 | \n", "HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O | \n", "- | \n", "- | \n", "5282 | \n", "14823 | \n", "unique.20010415093936.arc.gz | \n", "
4 | \n", "com,voter)/home/candidates/info/0,1214,2-18886-,00.html | \n", "20001002185816 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html | \n", "text/html | \n", "200 | \n", "QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO | \n", "- | \n", "- | \n", "4823 | \n", "20105 | \n", "unique.20010415093936.arc.gz | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1096875 | \n", "com,voter)/home/candidates/info/0,1214,2-9118-,00.html | \n", "20001002183052 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-9118-,00.html | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "118 | \n", "145323588 | \n", "unique.20010415093936.arc.gz | \n", "
1096876 | \n", "com,voter)/home/candidates/info/0,1214,2-9115-,00.html | \n", "20001002183052 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-9115-,00.html | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "118 | \n", "145323706 | \n", "unique.20010415093936.arc.gz | \n", "
1096877 | \n", "com,voter)/home/candidates/info/0,1214,2-15361-,00.html | \n", "20001002182249 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-15361-,00.html | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "119 | \n", "145323824 | \n", "unique.20010415093936.arc.gz | \n", "
1096878 | \n", "com,voter)/home/candidates/info/0,1214,2-12994-,00.html | \n", "20001002181842 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-12994-,00.html | \n", "text/html | \n", "404 | \n", "UDSH36NBYWO2X73LNMX2LEHLNQ7FYXHZ | \n", "- | \n", "- | \n", "351 | \n", "145323943 | \n", "unique.20010415093936.arc.gz | \n", "
1096879 | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "
1096880 rows × 11 columns
\n", "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "redirect | \n", "metatags | \n", "file_size | \n", "offset | \n", "warc_filename | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "com,voter)/home/candidates/info/0,1214,2-11880-,00.html | \n", "20001002182124 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-11880-,00.html | \n", "text/html | \n", "200 | \n", "FYXP43MQC5GVBQMVK3ETWSPXUBR5ICKP | \n", "- | \n", "- | \n", "5051 | \n", "149 | \n", "unique.20010415093936.arc.gz | \n", "
1 | \n", "com,voter)/home/candidates/info/0,1214,2-18885-,00.html | \n", "20001002185814 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-18885-,00.html | \n", "text/html | \n", "200 | \n", "H6QN5ZULJ6YZP756QNVM3YXKXC7HZUIL | \n", "- | \n", "- | \n", "4829 | \n", "5200 | \n", "unique.20010415093936.arc.gz | \n", "
2 | \n", "com,voter)/home/candidates/info/0,1214,2-18880-,00.html | \n", "20001002185815 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-18880-,00.html | \n", "text/html | \n", "200 | \n", "HFG67JI4KBPHFXMQE5DJRHF3OEKKBOO6 | \n", "- | \n", "- | \n", "4794 | \n", "10029 | \n", "unique.20010415093936.arc.gz | \n", "
3 | \n", "com,voter)/home/officials/general/1,1195,2-2467-,00.html | \n", "20001002185815 | \n", "http://voter.com:80/home/officials/general/1,1195,2-2467-,00.html | \n", "text/html | \n", "200 | \n", "HZJFLTHZD5MGEPJS2WVGBHQRQUPFBE3O | \n", "- | \n", "- | \n", "5282 | \n", "14823 | \n", "unique.20010415093936.arc.gz | \n", "
4 | \n", "com,voter)/home/candidates/info/0,1214,2-18886-,00.html | \n", "20001002185816 | \n", "http://www.voter.com:80/home/candidates/info/0,1214,2-18886-,00.html | \n", "text/html | \n", "200 | \n", "QAM7JW7S4CNYMP6HLA6DASOXTO2SIGWO | \n", "- | \n", "- | \n", "4823 | \n", "20105 | \n", "unique.20010415093936.arc.gz | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
338148 | \n", "org,ctgop)/county/tolland.htm | \n", "20001006073643 | \n", "http://www.ctgop.org:80/county/tolland.htm | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "101 | \n", "79251104 | \n", "unique.20010415101811.arc.gz | \n", "
338149 | \n", "org,ctgop)/county/tolland.htm | \n", "20001005073549 | \n", "http://www.ctgop.org:80/county/tolland.htm | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "101 | \n", "79251205 | \n", "unique.20010415101811.arc.gz | \n", "
338150 | \n", "org,ctgop)/county/tolland.htm | \n", "20001004073505 | \n", "http://www.ctgop.org:80/county/tolland.htm | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "101 | \n", "79251306 | \n", "unique.20010415101811.arc.gz | \n", "
338151 | \n", "org,ctgop)/county/tolland.htm | \n", "20001003073437 | \n", "http://www.ctgop.org:80/county/tolland.htm | \n", "text/html | \n", "200 | \n", "TIRWMHRDJ5L22TJWCXVA6TNU5YOB65SW | \n", "- | \n", "- | \n", "1421 | \n", "79251407 | \n", "unique.20010415101811.arc.gz | \n", "
338152 | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "None | \n", "
1541579 rows × 11 columns
\n", "\n", " | urlkey | \n", "timestamp | \n", "original | \n", "mimetype | \n", "statuscode | \n", "digest | \n", "redirect | \n", "metatags | \n", "file_size | \n", "offset | \n", "warc_filename | \n", "domains | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
25175 | \n", "com,algore2000,search)/search | \n", "20001030063531 | \n", "http://search.algore2000.com:80/search/ | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "97 | \n", "7471624 | \n", "unique.20010415093936.arc.gz | \n", "search.algore2000.com | \n", "
26166 | \n", "com,algore2000,search)/search | \n", "20001030053022 | \n", "http://search.algore2000.com:80/search/ | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "97 | \n", "7587973 | \n", "unique.20010415093936.arc.gz | \n", "search.algore2000.com | \n", "
49892 | \n", "com,algore2000,search)/search | \n", "20001029053020 | \n", "http://search.algore2000.com:80/search/ | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "97 | \n", "10612154 | \n", "unique.20010415093936.arc.gz | \n", "search.algore2000.com | \n", "
73526 | \n", "com,algore2000,search)/search | \n", "20001028053001 | \n", "http://search.algore2000.com:80/search/ | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "99 | \n", "13619683 | \n", "unique.20010415093936.arc.gz | \n", "search.algore2000.com | \n", "
97191 | \n", "com,algore2000,search)/search | \n", "20001027053201 | \n", "http://search.algore2000.com:80/search/ | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "98 | \n", "16632272 | \n", "unique.20010415093936.arc.gz | \n", "search.algore2000.com | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
336264 | \n", "org,keyes2000)/images/newsimage.jpg | \n", "20001003073434 | \n", "http://keyes2000.org:80/images/newsimage.jpg | \n", "image/jpeg | \n", "200 | \n", "LWERVVNORJQ6IBZCJ4SBNH26JU6NH3MV | \n", "- | \n", "- | \n", "13527 | \n", "76178594 | \n", "unique.20010415101811.arc.gz | \n", "keyes2000.org | \n", "
336611 | \n", "com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | \n", "20001004075816 | \n", "http://www.algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "130 | \n", "76906140 | \n", "unique.20010415101811.arc.gz | \n", "algore2000.com | \n", "
336612 | \n", "com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | \n", "20001004073516 | \n", "http://algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "127 | \n", "76906270 | \n", "unique.20010415101811.arc.gz | \n", "algore2000.com | \n", "
336613 | \n", "com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | \n", "20001003075840 | \n", "http://www.algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | \n", "- | \n", "- | \n", "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ | \n", "- | \n", "- | \n", "130 | \n", "76906397 | \n", "unique.20010415101811.arc.gz | \n", "algore2000.com | \n", "
336614 | \n", "com,algore2000)/briefingroom/releases/pr_091300_gore_wins_4.html | \n", "20001003073434 | \n", "http://algore2000.com:80/briefingroom/releases/pr_091300_Gore_Wins_4.html | \n", "text/html | \n", "200 | \n", "6Y6BX6SUDNF5CASBJH2LASINQ46ASMQF | \n", "- | \n", "- | \n", "9606 | \n", "76906527 | \n", "unique.20010415101811.arc.gz | \n", "algore2000.com | \n", "
6448 rows × 12 columns
\n", "